38 research outputs found

    Automatic Detection of Performance Anomalies in Task-Parallel Programs

    Full text link
    To efficiently exploit the resources of new many-core architectures, integrating dozens or even hundreds of cores per chip, parallel programming models have evolved to expose massive amounts of parallelism, often in the form of fine-grained tasks. Task-parallel languages, such as OpenStream, X10, Habanero Java and C or StarSs, simplify the development of applications for new architectures, but tuning task-parallel applications remains a major challenge. Performance bottlenecks can occur at any level of the implementation, from the algorithmic level (e.g., lack of parallelism or over-synchronization), to interactions with the operating and runtime systems (e.g., data placement on NUMA architectures), to inefficient use of the hardware (e.g., frequent cache misses or misaligned memory accesses); detecting such issues and determining the exact cause is a difficult task. In previous work, we developed Aftermath, an interactive tool for trace-based performance analysis and debugging of task-parallel programs and run-time systems. In contrast to other trace-based analysis tools, such as Paraver or Vampir, Aftermath offers native support for tasks, i.e., visualization, statistics and analysis tools adapted for performance debugging at task granularity. However, the tool currently does not provide support for the automatic detection of performance bottlenecks and it is up to the user to investigate the relevant aspects of program execution by focusing the inspection on specific slices of a trace file. In this paper, we present ongoing work on two extensions that guide the user through this process.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

    Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

    Get PDF
    International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63Ă— compared to a state-of-the-art work-stealing scheduler

    Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

    Get PDF
    International audienceDynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5Ă— higher performance than NUMA-aware hierarchical work-stealing, and even 5.6Ă— compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications

    Direct-mapped versus : set-associative pipelined caches

    Get PDF
    Disponible dans les fichiers attachés à ce documen

    Architecture des ordinateurs

    No full text
    National audienc

    Etude de quelques organisations d'antémémoires

    Get PDF
    Les performances des systèmes batis autour d'un microprocesseur dépendent de plus en plus des performances de la hiérarchie mémoire, et plus particulièrement des antémémoires. En effet depuis ces dernières années, le temps de cycle du processeur a diminué beaucoup plus vite que le temps d'accès à la mémoire principale. Cette tendance a augmenté l'importance des antémémoires et de leur efficacité. Dans ce rapport, nous présentons trois nouvelles organisations d'antememoires situees sur la meme puce que le processeur et qui maximisent les performances des microprocesseurs. La premiere organisation est dérivée de l'organisation d'antémémoire unifiée à laquelle on a ajouté un tampon instructions. La deuxieme organisation est composée d'une antémémoire unifiée instructions et données et d'une antémémoire séparée instructions. Enfin nous présentons l'organisation d'antémémoires semi-unifiées. Cette organisation est composée de deux antémémoires, C1 et C2, séparées physiquement, mais destinées à recevoir instructions et données. L'antémémoire C1 est à la fois antémémoire principale pour les instructions et antémémoire secondaire pour les données (C2 est antémémoire principale pour les données et antémémoire secondaire pour les instructions). Ainsi le degré d'associativité pour les données et les instructions est artificiellement augmenté et l'espace de stockage se répartit dynamiquement entre instructions et données. Les antémémoires semi-unifiées diminuent le taux de défaut d'antémémoires par rapport aux organisations d'antémémoires couramment utilisées dans les microprocesseurs

    MODEE : smoothing branch and instruction cache miss penalties on deep pipelines

    Get PDF
    Pipelining is a major technique used in high performance processors. But a fundamental drawback of pipeling is the lost time due to branch instructions. A new organization for implementing branch instructions is presented : the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the pipeline depths may be addressed using this organization. MIDEE is based on the use of double fetch and decode, early computation of the target address for branch instructions and two instruction queues. The double fetch-decode concerns a pair of instructions stored at consecutive addresses. These instructions are then decoded simultaneously, but no execution hardware is duplicated,only useful instructions are effectively executed. A pair of instruction queues are used between the fetch-decode stages and execution stages, this allows to hide branch penalty and most of the instruction cache misses penalty. Trace driven simulations show that the performance of deep pipeline processor may dramatically be improved when the MIDEE organization is implemented : branch penalty is reduced and pipeline stall delay due to instruction cache misses is also decreased

    Semi-inified caches

    Get PDF
    Since the gap between main memory access time and processor cycle time is continuously increasing, processor performance dramatically depends on the behavior of caches and particularly on the behavior of small on-chip caches. In this paper, we present a new organisation for on-chip caches : the semi-unified cache organization. In most microprocessors, two physically split caches are used for respectively storing data and instructions. The purpose of the semi-unified cache organization is to use the data cache (resp. instruction cache) as an on-chip second-level cache for instructions (resp. data). Thus the associativity degree of both on-chip caches is artificially increased and the cache spaces respectively devoted to instructions and data are dynamically adjusted. The off-chip miss ratio of a semi-unified cache built with two direct-mapped caches of size S is equal to the miss ratio of a unified two-way set associative cache of size 2S ; yet, the hit time of this semi-unified cache is equal to the hit time of a direct-mapped cache ; moreover both instructions and data may be accessed in parallel as for the split data/instruction cache organization. Since on-chip miss penalty is lower than off-chip miss penalty, trace driven simulations show that using a direct-mapped semi-unified cache organization leads to higher overall system performance than using usual split instruction/data cache organization
    corecore